Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: porting ngsderive #35

Open
wants to merge 91 commits into
base: main
Choose a base branch
from
Open

feat: porting ngsderive #35

wants to merge 91 commits into from

Conversation

a-frantz
Copy link
Member

@a-frantz a-frantz commented Dec 4, 2023

Porting https://github.com/stjudecloud/ngsderive to this repo.

An attempt was made to keep the algorithms in this port and the original repo as similar/consistent as possible. However some tweaks and improvements were made, which will be detailed below. The biggest difference between the Python code and the Rust code is the output results. ngsderive reported TSVs with pretty limited information. This code reports results as JSON and outputs comprehensive metrics related to processing, to make debugging/assessment easier. I'm not going to detail all the new metrics reported here, check out the code for that.

  • encoding
    • no significant changes
  • endedness
    • Started checking the is_segmented bitwise flag. This led to a rework of Single-End classification.
      • if is_segmented == false we DO NOT check is_first or is_last
      • if all reads have is_segmented == false then we call it Single-End
        • (still have the optional condition of RPT==1.0 to classify as Single-End)
      • No longer do we expect SE data to have BOTH first and last flags set.
      • Paired-End logic is essentially the same, but now we require that every read is_segmented
    • In Rust, we aren't using any fancy data structure for QNAMEs (in Python we use a prefix tree). I tried a prefix tree in Rust and it horribly tanked performance. IDK why, it should improve things. Gave up on optimizing this command, we can always come back to it.
  • instrument:
    • Was already implemented. Not significantly changed by this PR, but there were some tweaks so it behaved similarly to the other (new) derive commands.
    • reports per-RG results now
    • on paper this was a multi-threaded implementation. That really confused me because it's not actually multi-threaded. So I stripped away that async wrapping and made it clear that it was single-threaded.
    • I added some additional metric reporting
    • I added a record counter that logs every 1million reads
    • No other changes of import
  • junction-annotation:
    • biggest change also impacted strandedness: MAPQ filtering
      • This Simon Andrews article is a good rundown of the various ways MAPQs are (erroneously) used in bioinformatics: https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/
      • Notably, STAR sets a MAPQ of 255 for unique alignments and the spec (and therefore noodles) interprets that 255 as the special value meaning "MAPQ is missing".
      • noodles never returns a MAPQ of 255 because of this. PYSAM (in Python) does return values of 255.
      • In ngsderive, we have a MAPQ filter of 30 as the default. On our STAR data, this meant "only look at unique mappers. (Discard multi-mappers.)". Doesn't work in Rust when we're using noodles.
      • My resolution to this was to default to "no MAPQ filter" which in practice on STAR data means look at both unique mappers and multi-mappers.
      • If a user wants to enable a MAPQ filter they can, it will just block any 255 values.
    • This implementation readily lends itself to being adopted by the junction-saturation algorithm (unlike the Python code, which was complicated by a reliance on PYSAM). So we can implement that pretty easily in the future.
  • readlen:
    • nothing significant
  • strandedness:
    • Same MAPQ filter problem+solution as junction-annotation
    • So as I wrote this, I discovered more and more problems with the Python code. Some big some small problems. But hopefully all "fixed" here.
    • I think this had the largest impact on results: there was no (working) filter for genes with mixed-strand exons. There was code for disqualifying these cases, it just didn't work in Python.
    • Now, all genes are ensured to be either on the Forward or Reverse Strand. Then it's ensured all overlapping exons are on that same strand. This check disqualifies roughly half of the protein coding genes in Gencode release v32 (what I tested with). That might sound like a lot of genes to disqualify, but I think it checks out for two reasons.
      1. results are significantly better with this filter than without. I wasn't rigorous and didn't do any stats, but by-my-eyeball, with the strand filter I was getting results in the >99% reverse evidence area, and without the strand filter the same samples were returning reverse evidence in the ~90% area.
      2. a nature article that get similar-ish numbers as me with a similar-ish analysis: https://www.nature.com/articles/s41598-019-49802-w#:~:text=Accordingly%2C%20we%20used%20the%20current,overlapped%20with%20their%20adjacent%20genes (discovered courtesy of Andrew T.)
    • another filter that didn't work in Python: --min-reads-per-gene. Everything was being counted as evidence. This filter didn't do anything. It now works. I didn't notice any swing in the results when I played with disabling this. IDK if it really impacts results much at all. At least when it's at a value of 10 VS 0. Maybe jacking it up would change things? IDK, haven't investigated.

Before submitting this PR, please make sure:

  • You have added a few sentences describing the PR here.
  • You have added yourself or the appropriate individual as the assignee.
  • You have added at least one relevant code reviewer to the PR.
  • You have added any relevant tags to the pull request.
  • Your code builds clean without any errors or warnings (use cargo test and cargo clippy).
  • You have added tests (when appropriate).
  • You have updated the wiki (when appropriate).
  • You have updated the README or other documentation to account for these changes (when appropriate).

@a-frantz a-frantz self-assigned this Dec 4, 2023
@a-frantz a-frantz added the feature Introduces a new feature to the codebase label Dec 4, 2023
@a-frantz a-frantz marked this pull request as ready for review February 22, 2024 17:10
Copy link
Member

@claymcleod claymcleod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some intermediate comments (some are old) from my last review. I want to get these in front of you so you can address them before squashing or whatever you plan to do with the commits here.

src/derive/command/encoding.rs Show resolved Hide resolved
src/derive/command/encoding.rs Outdated Show resolved Hide resolved
src/derive/command/encoding.rs Outdated Show resolved Hide resolved
src/derive/command/endedness.rs Outdated Show resolved Hide resolved
src/derive/command/endedness.rs Outdated Show resolved Hide resolved
@@ -20,6 +25,25 @@ pub struct DeriveArgs {
/// All possible subcommands for `ngs derive`.
#[derive(Subcommand)]
pub enum DeriveSubcommand {
/// Derives the quality score encoding used to produce the file.
Encoding(self::encoding::DeriveEncodingArgs),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, we're going to change how this is named slightly in the future.

src/derive/command.rs Outdated Show resolved Hide resolved
src/derive/command.rs Outdated Show resolved Hide resolved
src/derive/command.rs Outdated Show resolved Hide resolved
src/derive/command/encoding.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants